Effective Seed URL Selection and Scope Extension Algorithm for Web Crawler

نویسندگان

چکیده

Web, hızla büyüyen ve her türden verilerin bulunduğu devasa bir veri kaynağıdır. Kullanıcılar bu kaynağından istedikleri verileri almak için arama motorlarını kullanırlar. Arama motorları web tarayıcıları ile elde ederler. Web sayfalarındaki tek düzen kaynak bulucuları (URL-Uniform Resource Locator) izleyerek ulaştıkları tüm sayfalardaki alır, ayrıştırır indekslerler. tarama sürecindeki en önemli konular hangi URL’lerden başlanacağı taramanın kapsamıdır. Bu yazıda kapsamı olan genel tarayıcının tohum URL seçim kapsam genişletme yöntemleri sunulmuştur. Tohum seçiminde 102 farklı ülkede ziyaretçinin günlük harcadığı saat, ziyaretçi başına sayfa görüntüleme sayısı, aramadan gelen trafiğin yüzdesi toplam bağlı site sayısı temel alınarak oluşturulmuş üç seti oluşturulup detaylı şekilde performansları analiz edilmiştir. Ayrıca hızlı genişletmek link skoruna dayalı yeni algoritması önerilmiş, setleri kullanılarak taramalar yapılmış, karşılaştırılmış analizleri yapılmıştır.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

URL Mining Using Web Crawler in Online Based Content Retrieval

A supervised web scale forum crawler is a crawling process of forum crawler under supervision(Focus). The main aim of Focus is to crawl related content from the web with minimal overhead and also detect the duplicate links.Forums can contain different layouts or styles and are powered by a variety of forum software packages. Focus take six path from entry page to thread page. It helps the frequ...

متن کامل

An Effective Focused Web Crawler for Web Resource Discovery

In the given volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Web crawling is the process used by search engines to collect pages from the Web. Therefore, collecting domain-specific information from the Web is a special theme of research in many papers. In this paper, we introduce a new effective focused web crawler. It uses smart methods to ...

متن کامل

An Effective Deep Web Interfaces Crawler Framework Using Dynamic Web

An effective deep web interfaces harvesting framework, namely SmartCrawler, for achieving both wide coverage and high efficiency for a focused crawler. Based on the observation that deep websites usually contain a few searchable forms and most of them are within a depth of three our crawler is divided into two stages: site locating and in-site exploring. The site locating stage helps achieve wi...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

A Framework for Deep Web Crawler Using Genetic Algorithm

The Web has become one of the largest and most readily accessible repositories of human knowledge. The traditional search engines index only surface Web whose pages are easily found. The focus has now been moved to invisible Web or hidden Web, which consists of a large warehouse of useful data such as images, sounds, presentations and many other types of media. To use such data, there is a need...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International journal of advances in engineering and pure sciences

سال: 2023

ISSN: ['2636-8277']

DOI: https://doi.org/10.7240/jeps.1174193